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We define the relevant information in a signal a; G X as being the in- 
formation that this signal provides about another signal y & Y . Examples 
include the information that face images provide about the names of the peo- 
ple portrayed, or the information that speech sounds provide about the words 
spoken. Understanding the signal x requires more than just predicting it 
also requires specifying which features of X play a role in the prediction. We 
formalize this problem as that of finding a short code for X that preserves the 
maximum information about Y . That is, we squeeze the information that X 
provides about Y through a 'bottleneck' formed by a limited set of codewords 
X. This constrained optimization problem can be seen as a generalization of 
rate distortion theory in which the distortion measure d{x, x) emerges from 
the joint statistics of X and Y . This approach yields an exact set of self 
consistent equations for the coding rules X ^ X and X — > F. Solutions 
to these equations can be found by a convergent re-estimation method that 
generalizes the Blahut-Arimoto algorithm. Our variational principle pro- 
vides a surprisingly rich framework for discussing a variety of problems in 
signal processing and learning, as will be described in detail elsewhere. 
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1 Introduction 



A fundamental problem in formalizing our intuitive ideas about information 
is to provide a quantitative notion of "meaningful" or "relevant" information. 
These issues were intentionally left out of information theory in its original 
formulation by Shannon, who focused attention on the problem of transmit- 
ting information rather than judging its value to the recipient. Correspond- 
ingly, information theory has often been viewed as being strictly a theory 
of communication, and this view has become so accepted that many people 
consider statistical and information theoretic principles as almost irrelevant 
for the question of meaning. In contrast, we argue here that information the- 
ory, in particular lossy source compression, provides a natural quantitative 
approach to the question of "relevant information." Specifically, we formu- 
late a variational principle for the extraction or efficient representation of 
relevant information. In related work we argue that this single informa- 
tion theoretic principle contains as special wide variety of problems, 
including prediction, filtering, and learning in its various forms. 

The problem of extracting a relevant summary of data, a compressed 
description that captures only the relevant or meaningful information, is not 
well posed without a suitable definition of relevance. A typical example is 
that of speech compression. One can consider lossless compression, but in 
any compression beyond the entropy of speech some components of the signal 
cannot be reconstructed. On the other hand, a transcript of the spoken words 
has much lower entropy (by orders of magnitude) than the acoustic waveform, 
which means that it is possible to compress (much) further without losing 
any information about the words and their meaning. 

The standard analysis of lossy source compression is "rate distortion the- 
ory," which characterizes the tradeoff between the rate, or signal represen- 
tation size, and the average distortion of the reconstructed signal. Rate 
distortion theory determines the level of inevitable expected distortion, D, 
given the desired information rate, i?, in terms of the rate distortion function 
R{D). The main problem with rate distortion theory is in the need to specify 
the distortion function first, which in turn determines the relevant features 
of the signal. Those features, however, are often not explicitly known and 
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an arbitrary choice of the distortion function is in fact an arbitrary feature 
selection. 

In the speech example, we have at best very partial knowledge of what 
precise components of the signal are perceived by our (neural) speech recog- 
nition system. Those relevant components depend not only on the complex 
structure of the auditory nervous system, but also on the sounds and utter- 
ances to which we are exposed during our early life. It therefore is virtually 
impossible to come up with the "correct" distortion function for acoustic 
signals. The same type of difficulty exists in many similar problems, such 
as natural language processing, bioinformatics (for example, what features of 
protein sequences determine their structure) or neural coding (what informa- 
tion is encoded by spike trains and how). This is the fundamental problem 
of feature selection in pattern recognition. Rate distortion theory does not 
provide a full answer to this problem since the choice of the distortion func- 
tion, which determines the relevant features, is not part of the theory. It is, 
however, a step in the right direction. 

A possible solution comes from the fact that in many interesting cases we 
have access to an additional variable that determines what is relevant. In the 
speech case it might be the transcription of the signal, if we are interested 
in the speech recognition problem, or it might be the speaker's identity if 
speaker identification is our goal. For natural language processing, it might 
be the part of speech labels for words in grammar checking, but the dictionary 
senses of ambiguous words in information retrieval. Similarly, for the protein 
folding problem we have a joint database of sequences and three dimensional 
structures, and for neural coding a simultaneous recording of sensory stimuli 
and neural responses defines implicitly the relevant variables in each domain. 
All of these problems have the same formal underlying structure: extract the 
information from one variable that is relevant for the prediction of another 
one. The choice of additional variable determines the relevant components 
or features of the signal in each case. 

In this short paper we formalize this intuitive idea using an informa- 
tion theoretic approach which extends elements of rate distortion theory. 
We derive self consistent equations and an iterative algorithm for finding 
representations of the signal that capture its relevant structure, and prove 
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convergence of this algorithm. 



2 Relevant quantization 

Let X denote the signal (message) space with a fixed probability measure 
p{x), and let X denote its quantized codebook or compressed representation. 
For ease of exposition we assume here that both of these sets are finite, that 
is, a continuous space should first be quantized. 

For each value x E X we seek a possibly stochastic mapping to a repre- 
sentative, or codeword in a codebook, x & X, characterized by a conditional 
p.d.f. p{x\x). The mapping p{x\x) induces a soft partitioning of X in which 
each block is associated with one of the codebook elements x E X, with 
probability given by 

p{x) = '^p{x)p{x\x) . (1) 

X 

The average volume of the elements of X that are mapped to the same 
codeword is 2^^^!^), where 

H{X\X) = -J2 P{x) J2 P{£\'^) ^ogp{x\x) (2) 

is the conditional entropy of X given X. 

What determines the quality of a quantization? The first factor is of 
course the rate, or the average number of bits per message needed to specify 
an element in the codebook without confusion. This number per element of 
X is bounded from below by the mutual information 

I{X;X)=Y: Ep(^'^)log 

since the average cardinality of the partitioning of X is given by the ratio of 
the volume of X to that of the mean partition, 2^^^1 /2"^^\^1 = 2^^^'^\ via 
the standard asymptotic arguments. Notice that this quantity is different 
from the entropy of the codebook, H{X), and this entropy normally is not 
what we want to minimize. 



p{x\x) 



p(x) 



(3) 
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However, information rate alone is not enough to characterize good quan- 
tization since the rate can always be reduced by throwing away details of the 
original signal x. We need therefore some additional constraints. 



2.1 Relevance through distortion: 
Rate distortion theory 

In rate distortion theory such a constraint is provided through a distortion 
function, d : X x X ^ TZ^, which is presumed to be small for good represen- 
tations. Thus the distortion function specifies implicitly what are the most 
relevant aspects of values in X. 

The partitioning of X induced by the mapping p{x\x) has an expected 
distortion 

There is a monotonic tradeoff between the rate of the quantization and the 
expected distortion: the larger the rate, the smaller is the achievable distor- 
tion. 

The celebrated rate distortion theorem of Shannon and Kolmogorov (see, 
for example Ref. 0) characterizes this tradeoff through the rate distortion 
function, R{D), defined as the minimal achievable rate under a given con- 
straint on the expected distortion: 

= min (5) 

{p(x\x):{d(x,x)}<D I 

Finding the rate distortion function is a variational problem that can be 
solved by introducing a Lagrange multiplier, /5, for the constrained expected 
distortion. One then needs to minimize the functional 

J^[pix\x)] = I{X- X) + (3{d{x, S:))p(x,x) (6) 

over all normalized distributions p{x\x). This variational formulation has the 
following well known consequences: 
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Theorem 1 The solution of the variational problem, 
6p{x\x) 

for normalized distributions p{x\x) , is given by the exponential form 



p{x\x) 



p{x) 



exp [—pd{x, x)] 



(7) 



(8) 



Z{x,/3) 

where Z{x,j3) is a normalization (partition) function. Moreover, the La- 
grange multiplier [3, determined by the value of the expected distortion, D, is 
positive and satisfies 

6R 



SD 



(9) 



Proof. Taking the derivative w.r.t. p{x\x), for given x and x, one obtains 



6p{x\x) 



p{x) 



log 



+ 1 



p{x\x) 
p{x) 

— ^^p{x')p{x\x') + I3d{x, x) + ^^^^ 



p{x) 



(10) 



since the marginal distribution satisfies p{x) = J2x' p{^')p{^W)- ^{^) are 
the normalization Lagrange multipliers for each x. Setting the derivatives to 
zero and writing logZ{x,l3) = X{x)/p{x), we obtain Eq. (P). When varying 
the normalized p{x\x), the variations SI{X; X) and 6{d{x, x))p(x,x) are linked 
through 

5^ = 6I{X- X) + (36{d{x, = 0, (11) 

from which Eq. (^) follows. The positivity of (3 is then a consequence of 
the concavity of the rate distortion function (see, for example, Chapter 13 of 
Ref. [H). □ 
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2.2 The Blahut— Arimoto algorithm 

An important practical consequence of the above variational formulation is 
that it provides a converging iterative algorithm for self consistent determi- 
nation of the distributions p{x\x) and p{x). 

Equations (H) and (0) must be satisfied simultaneously for consistent 
probability assignment. A natural approach to solve these equations is to 
alternately iterate between them until reaching convergence. The following 
lemma, due to Csiszar and Tusnady 0, assures global convergence in this 
case. 

Lemma 2 Let p{x,y) = p{x)p{y\x) be a given joint distribution. Then the 
distribution q{y) that minimizes the relative entropy or Kullback-Leibler di- 
vergence, DKL[p{x,y)\p{x)q{y)], is the marginal distribution 

X 

Namely, 

= DKL[p{,x,y)\p{x)p{y)] = Yam.DKL[p{,x,y)\p{x)q{y)] . 

i{y) 

Equivalently, the distribution q{y) which minimizes the expected relative en- 
tropy, 

J2p{x)DKL\p{y\x)\'iiy)]^ 

X 

is also the marginal distribution p{y) = J2xP{x)p{y\x) . 

The proof follows directly from the non-negativity of the relative entropy. 

This lemma guarantees the marginal condition Eq. (|l|) through the same 
variational principle that leads to Eq. (j^): 

Theorem 3 Equations (jip and are satisfied simultaneously at the mini- 
mum of the functional, 

^ = -(logZ(a;,/3))p(,) = J(X;X)+/5(d(x,x))p(,,s) , (12) 

where the minimization is done independently over the convex sets of the 
normalized distributions, {p{x)} and {p{x\x)}, 

mm mm J-" [p{x);p{x\x)] . 

p(x) p{x\x) 
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These independent conditions correspond precisely to alternating iterations 
of Eq. and Eq. Denoting by t the iteration step, 

Pt+l{x) = ExP{x)pt{x\x) 

Pt{x\x) = ^^exp{-pd{x,x)) 

where the normalization function Zt{x,l3) is evaluated for every t in Eq. 
(|73|j. Furthermore, these iterations converge to a unique minimum of J-" in 
the convex sets of the two distributions. 

For the proof, see references ^. This alternating iteration is the well 
known Blauht-Arimoto (BA) algorithm for calculation of the rate distortion 
function. 

It is important to notice that the BA algorithm deals only with the op- 
timal partitioning of the set X given the set of representatives X, and not 
with an optimal choice of the representation X. In practice, for finite data, 
it is also important to find the optimal representatives which minimize the 
expected distortion, given the partitioning. This joint optimization is similar 
to the EM procedure in statistical estimation and does not in general have a 
unique solution. 

3 Relevance through another variable: 
The Information Bottleneck 

Since the "right" distortion measure is rarely available, the problem of rel- 
evant quantization has to be addressed directly, by preserving the relevant 
information about another variable. The relevance variable, denoted here by 
Y, must not be independent from the original signal X, namely they have 
positive mutual information /(X; Y). It is assumed here that we have access 
to the joint distribution p{x,y), which is part of the setup of the problem, 
similarly to p{x) in the rate distortion case.Q 

^The problem of actually obtaining a good enough sample of this distribution is an 
interesting issue in learning theory, but is beyond the scope of this paper. For a start on 
this problem see Ref. 
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3.1 A new variational principle 

As before, we would like our relevant quantization X to compress X as much 
as possible. In contrast to the rate distortion problem, however, we now want 
this quantization to capture as much of the information about Y as possible. 
The amount of information about F in X is given by 



Obviously lossy compression cannot convey more information than the orig- 
inal data. As with rate and distortion, there is a tradeoff between compress- 
ing the representation and preserving meaningful information, and there is 
no single right solution for the tradeoff. The assignment we are looking for is 
the one that keeps a fixed amount of meaningful information about the rel- 
evant signal Y while minimizing the number of bits from the original signal 
X (maximizing the compression) . In effect we pass the information that X 
provides about Y through a "bottleneck" formed by the compact summaries 
in X. 

We can find the optimal assignment by minimizing the functional 



where (3 is the Lagrange multiplier attached to the constrained meaningful 
information, while maintaining the normalization of the mapping p{x\x) for 
every x. At /5 = our quantization is the most sketchy possible — everything 
is assigned to a single point — while as /? ^ oo we are pushed toward arbitrar- 
ily detailed quantization. By varying the (only) parameter (3 one can explore 
the tradeoff between the preserved meaningful information and compression 
at various resolutions. As we show elsewhere ^, for interesting special 
cases (where there exist sufficient statistics) it is possible to preserve almost 
all the meaningful information at finite j3 with a significant compression of 
the original data. 

^It is completely equivalent to maximize the meaningful information for a fixed com- 
pression of the original variable. 




(14) 



C[v{5:\x)]=I{X-X)-pi{X-Y) 



(15) 
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3.2 Self-consistent equations 

Unlike the case of rate distortion theory, here the constraint on the meaning- 
ful information is nonlinear in the desired mapping p{x\x) and this is a much 
harder variational problem. Perhaps surprisingly, this general problem of 
extracting the meaningful information — minimizing the functional £[p(x|x)] 
in Eq. (|l^) — can be given an exact formal solution. 

Theorem 4 The optimal assignment, that minimizes Eq. I^Tdj), satisfies the 
equation 



PW^) = ^7 — ^exp 



' p{y\x) 



(16) 



where the distribution p{y\x) in the exponent is given via Bayes' rule and the 
Markov chain condition X ^ X ^ Y , as, 

p{y\S:) = ^^Yp{y\x)p{x\x)p{x). (17) 
P{x) X 

This solution has a number of interesting features, but we must emphasize 
that it is a /orma/ solution since p{y\x) in the exponential is defined implicitly 
in terms of the assignment mapping p{x\x). Just as before, the marginal 
distribution p{x) must satisfy the marginal condition Eq. ([l|) for consistency. 
Proof. First we note that the conditional distribution of y on x 

p{y\S:) = P{y\x)p{x\x) , (18) 

follows from the Markov chain condition Y ^ X X.f\ The only varia- 
tional variables in this scheme are the conditional distributions, p{x\x), since 
other unknown distributions are determined from it through Bayes' rule and 
consistency. Thus we have 

p{x) = ^p(x|a;)p(a;) , (19) 



■^It is important to notice that this not a modehng assumption and the quantization X 
is not used as a hidden variable in a model of the data. In the latter, the Markov condition 
would have been different: Y ^ X ^ X . 
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and 



X 

The above equations imply the following derivatives w.r.t. p{x\x), 
6p{x) 



and 



Sp{x\x) 



Sp{x\y) 



p{x) 



p{x\y) . 



(20) 



(21) 



(22) 



5p{x\x) 

Introducing Lagrange multipliers, (3 for the information constraint and X{x) 
for the normalization of the conditional distributions p{x\x) at each x, the 
Lagrangian, Eq. 



becomes 

C = I{X,X) - (3I{X,Y) -Y,Hx)pix\x) 

p{x\x 



(23) 



'^p{x\x)p{x) log 



p{x) 
— ^ X{x)p{x\x) 



/5lZp(^,2/)log 



x,y 



pjxly) 
p{x) 



(24) 



Taking derivatives with respect to p{x\x) for given x and x, one obtains 



6C 



Sp{x\x) 



p{x) [1 + \ogp{x\x)] 



6p{x) 
6p{x\x] 



[1 + \ogp{x)] 



-/^ESSMi/) [l + logp{x 



-/3 



y 6P{X\X 

6p{x 



6p{x\x) 



[1 + \ogp{x)] - X{x) . 



Substituting the derivatives from Eq's. (^I]) and (|2^ ) and rearranging, 



SC 

6p{x\x) 



p{x) < log 



p{x\x) 
p{x) 



(3J2p{y\x) log 



p{y\x) 
. p{y) 



Hx) 

p{x) 



(25) 



.(26) 
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Notice that Y^yPivl^) log = -^(^? ^) is a function of x only (independent 
of x) and thus can be absorbed into the multipher X{x). Introducing 



A(x) 



Hx) 
p{x) 



(3Y,p{y\x) log 



we finally obtain the variational condition: 



6C 



p{x) 



log 



p{x\x) 



, p{y) 



p{y\x) 



6p{x\x 

which is equivalent to equation (|1 

p{x 



+ /5lZp(2/|a;)log— — - - \{x) 



p{x\x) 



Z{x,P) 



p{x) Y p(y\^) 

for p{x\x), 
exp {-(3Dkl \p{y\x)\p{y\x)\) , 



, (27) 



with 



Z{x,(3) = exp[/3A(x)] = ^p(x) exp (-/3L)i^L [p{y\x)\p{y\x)]) 



(28) 



the normalization (partition) function. 
Comments: 



□ 



1. The KuUback-Leibler divergence, DKL\p{y\x)\p{y\x)], emerged as the 
relevant "effective distortion measure" from our variational principle 
but is not assumed otherwise anywhere! It is therefore natural to con- 
sider it as the "correct" distortion d{x,x) = DKL[p{y\x)\p{y\x)] for 
quantization in the information bottleneck setting. 

2. Equation (pSj), together with equations ([T8|) and (0), determine self 
consistently the desired conditional distributions p{x\x) and p{x). The 
crucial quantization is here performed through the conditional distri- 
butions p{y\x), and the self consistent equations include also the opti- 
mization over the representatives, in contrast to rate distortion theory, 
where the selection of representatives is a separate problem. 
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3.3 The information bottleneck iterative algorithm 

As for the BA algorithm, the self consistent equations ( |16|) and ([17D suggest 
a natural method for finding the unknown distributions, at every value of /3. 
Indeed, these equations can be turned into converging, alternating iterations 
among the three convex distribution sets, {p{x\x)}, {p{x)}, and {p{y\x)}, as 
stated in the following theorem. 



Theorem 5 The self consistent equations 1^1^), (^^, and ^^), are satisfied 
simultaneously at the minima of the functional, 

J^[p{x\x)]p{x)]p{y\x)] = -(logZ(a;,/3))p(^) (29) 

= IiX;X) + /3{DKL[piy\xMy\x)])pi^,i) (30) 

where the minimization is done independently over the convex sets of the 
normalized distributions, {p{x)} and {p{x\x)} and {p{y\x)}. Namely, 

min min rnin JF [p(x|x); p(x); p(?/|x)] . 

p{y\x) p{x) p{x\x) 

This minimization is performed by the converging alternating iterations. De- 
noting by t the iteration step, 

Pt{x\x) = ■£^^exp{-pd{x,x)) 

Pt+l{x) = T,:,,p{x)pt{x\x) (31) 

^ Pt+iiy\x) = EyPiy\x)ptix\x) 

and the normalization (partition function) Zt{j3,x) is evaluated for every t 
in Eq. i\3i 



Proof. For lack of space we can only outline the proof. First we show that 
the equations indeed are satisfied at the minima of the functional (known 
for physicists as the "free energy"). This follows from lemma (H) when applied 
to I{X] X) with the convex sets of p{x) and p{x\x), as for the BA algorithm. 
Then the second part of the lemma is applied to {DKL[p{y\x)\p{y\x)])p(^:;c^x) 
which is an expected relative entropy. Equation (pHj) minimizes the expected 
relative entropy w.r.t. to variations in the convex set of the normalized 
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{p{y\x)}. Denoting by d{x,x) = DKL[p{y\x)\p{y\x)] and by A(x) the normal- 
ization Lagrange multipliers, we obtain 

6d{x,x) = 6 i-^p{y\x) log p{y\x) + \{x)C^p{y\x) - 1)\ (32) 
\ y y / 

- + (33) 

The expected relative entropy becomes, 



which gives Eq. (pHj), since 5p{y\x) are independent for each x. Equation 
( [28| ) also have the interpretation of a weighted average of the data conditional 
distributions that contribute to the representative x. 

To prove the convergence of the iterations it is enough to verify that 
each of the iteration steps minimizes the same functional, independently, 
and that this functional is bounded from below as a sum of two non-negative 
terms. The only point to notice is that when p{y\x) is fixed we are back to 
the rate distortion case with fixed distortion matrix d{x,x). The argument 
in for the BA algorithm applies here as well. On the other hand we 
have just shown that the third equation minimizes the expected relative 
entropy without affecting the mutual information I{X]X). This proves the 
convergence of the alternating iterations. However, the situation here is 
similar to the EM algorithm and the functional [p{x\x);p{x);p{y\x)] is 
convex in each of the distribution independently but not in the product space 
of these distributions. Thus our convergence proof does not imply uniqueness 
of the solution. □ 



3.4 The structure of the solutions 

The formal solution of the self consistent equations, described above, still 
requires a specification of the structure and cardinality of X, as in rate 
distortion theory. For every value of the Lagrange multiplier /3 there are 
corresponding values of the mutual information Ix = I{X,X), and Jy = 
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I{X, Y) for every choice of the cardinahty of X. The variational principle 
implies that 

'Jim=p-^>0, (35) 

6I{X,X) ^ ' ^ ^ 

which suggests a deterministic annealing approach. By increasing the value 
of f3 one can move along convex curves in the "information plane" (/x,/y). 
These curves, analogous to the rate distortion curves, exists for every choice 
of the cardinality of X. The solutions of the self consistent equations thus 
correspond to a family of such annealing curves, all starting from the (trivial) 
point (0, 0) in the information plane with infinite slope and parameterized by 
/3. Interestingly, every two curves in this family separate (bifurcate) at some 
finite (critical) (3 through a second order phase transition. These transitions 
form a hierarchy of relevant quantizations for different cardinalities of X, as 
described in [0, ||, |] . 

Further work 

The most fascinating aspect of the information bottleneck principle is that it 
provides a unified framework for different information processing problems, 
including prediction, filtering and learning There are already several 
successful applications of this method to various "real" problems, such as 
semantic clustering of English words 0, document classification [^], neural 
coding, and spectral analysis. 
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